# Assignment 5: Chicago Food Inspections # MSDS420

**Author:** Atef Bader, Jonathan De Leon (Requirements portion) **Last Edit:** 9/3/2021

Deliverables:

Objectives:

In this assignment, you will:

Submission Formats :

Create a folder or directory with all supplementary files with your last name at the beginning of the folder name, compress that folder with zip compression, and post the zip-archived folder under the assignment link in Canvas. The following files should be included in an archive folder/directory that is uploaded as a single zip-compressed file. (Use zip, not StuffIt or any 7z or any other compression method.)

  1. Complete IPYNB script that has the source code in Python used to access and analyze the data. The code should be submitted as an IPYNB script that can be be loaded and run in Jupyter Notebook for Python
  2. Output from the program, such as console listing/logs, text files, and graphics output for visualizations. If you use the Data Science Computing Cluster or School of Professional Studies database servers or systems, include Linux logs of your sessions as plain text files. Linux logs may be generated by using the script process at the beginning of your session, as demonstrated in tutorial handouts for the DSCC servers.
  3. List file names and descriptions of files in the zip-compressed folder/directory.

Formatting Python Code When programming in Python, refer to Kenneth Reitz’ PEP 8: The Style Guide for Python Code: http://pep8.org/ (Links to an external site.)Links to an external site. There is the Google style guide for Python at https://google.github.io/styleguide/pyguide.html (Links to an external site.)Links to an external site. Comment often and in detail.

Assignment Description and Requirement Specifications

Chicago Food Inspections

Recent watchdog report published by Chicago Tribune indcated that food safety inspectors overlook hundreds of day cares in the city of Chicago.

image.png

The key take away from the Chicago Tribune watchdog report is that the city had only 33 working field inspectors to cover the entire city of Chicago. Many of the facilties serve food for Children, and while few fail inspectionns, many escape routine inspections.

This is a classic resource allocation problem. In this assignment, our goal is to identify the hot-spots (areas that have facilities serving food to children and have failed inspections in the past) on the Chicago map to dispatch inspectors to.

To achive our goal, we need the following:

  1. Dataset for Chicago Food Inspections
  2. NoSQL database Egnine (ElasticSearch) for indexing and data retrieval
  3. HeatMap to plot the children facilties that failed Chicago Food Inspections

The CSV file for dataset of the city of chicago is obtained from the data portal for the city of Chicago. Here th elink for the city of Chicago data portal City of Chicago Data Portal

image.png

Loading the Dataset CSV file

Lets load the CSV file into a DataFrame object and see the nature of the data that we have.

Description of the dataset:

  1. It has 164953 inspection records
  2. It has inspection records from 2010 to 2018
  3. It has 17 fields

There are few fields in the dataset of interest for us:

  1. Risk
  2. Results
  3. Latitude
  4. Longtitude
  5. Inspection ID

We are also interested in any field that mentioned (or misspelled) the word Children

There are possibilities that the data entry clerk might've made some typos and misspellings and there are different words meant to indicate the same thing, some examples of this:

To perform different queries to retrieve the relevant inspection records, we will store the dataset in a NoSQL database engine ElasticSearch.

For more information on elastic search visit ElasticSearch

Please note that in this version of the assignment, the index for Chicago food inspections dataset already created on ElasticSearch on DSCC

ElasticSearch

The three major platofrms are supported:

  1. Windows
  2. MacOS
  3. Linux

Startup ElasticSearch Server

After you install ElasticSearch, go to the directory where you installed ElasticSearch under elasticsearch-6.2.3\bin directory and type from the terminal/command prompt the following command: elasticsearch

elasticsearch package

We need elasticsearch package to connect to ElasticSearch Servers

To install elastic search pakage, execute following command from the command/terminal windows:

Load and Index the Inspection Records into ElasticSearch

Inspection records are insreted into ElasticSearch engine using the bulk Api of elastic search.

Here is the link API DOCS for the API documentation.

Query is used to retieve data from ElasticSearch server

The query is used to retrieve data from ElasticSearch servers that match certain filters.

For information about the syntax and semantics for query, you can read the docs at the following URL QUERY DOCS

We will also use the scroll to retrive the data matching the our query. For more information about scroll, you can read the docs ta the following URL Scroll DOCS

We create our query to rertieve the inspections records we are interested in three experiements and will compare the results for each:

  1. Experiment #1: Using Regular Expressions using the term Children
  2. Experiment #2: Using Fuziness using the term Children's
  3. Experiment #3: Using Fuziness using the term Children

Experiment #1: Create the query using regex

Process the retrieved documents and filter fields we need for the Heatmap

We need to create a list-of-lists of the two fields, (Latitude and Longitude) for the HeatMap

We need to install folium package to plot the Map and Heatmaps

The official documentation can be accessed at this URL: Folium

To install Folium package execute following command from the Command/Terminal window:

For the different configuration paramteres for HeatMap, you can access the docs at this URL: HeatMap

Create the HeatMap

Create the query using fuzziness

Now lets try to retrieve documents using ElasticSearch fuzziness

The fuzzy query generates all possible matching terms that are within the maximum edit distance specified in fuzziness.

For information about the syntax and semantics for fuziness, you can read the docs at the following URL fuzziness

Experiment #2: We will first build our query with the parameters:

  1. "query": "Children",
  2. "fuzziness": "1",

Experiment #3: Lets now build our query with the parameters:

  1. "query": "Children's",
  2. "fuzziness": "1",

Frequent Violators:

Despite the fact that the city of Chicago has the department of Business Affairs and Consumer Protection to revoke business licensses to protect consumers, it appears many businesses with frequent violations have obtained new licenses under the same DBA name

image.png

Experiment #4: Lets get the top list of frequent violators:

Facilities that serve children can be classified under different Facility Types:

  1. Daycare Above and Under 2 Years
  2. Children's Services Facility
  3. Daycare (2 - 6 Years)

We will use ELasticSearch and Folium to plot on the map those facilities that failed inspection at least 5 times with risk high.

Loopholes

As you might have guessed by now, it must be really cheap to do so, those frequent violators reobtain business license multiple times under the same business name for only $165 application fee based on the official numbers published on the City of Chicago - Business Licensing

And it appears the city of Chicago is willinig to rubber-stamp the approval of the application for only $165, rather than imposing the very simple rule: ( 3 strikes and you are out )

image.png

Requirements

The PDF document your are submitting must have the source code and the output for the following four requirements

Requirement #1:

Provide your comparative analysis for the results obtained from 3 experiments you executed above

The first experiment using regex resulted in 601 results. Experiment 2 and 3 resulted in significantly less results (141 for experiment 2 and 145 for experiment 3). For experiments 2 and 3 we only used "fuzziness" with a score of one. This means that the term we entered could only have one edit to match. Comparatively, experiment 1 used regex with "*" at the beginning and end of "Children" that matches any character or quantity, so this means it could potentially match characters with a much higher fuzziness score. For example, an object with "grandchildren" could have been a result in experiment 1, whereas experiment 2 and 3 would not have had it (elasticsearch queries are also not case sensitive). Interestingly after only briefly looking at all the results, I believe the column or characteristic that pulled "children" the most was violations.

</font>